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Abstract 

Logarithms of determinants of large positive definite matrices appear ubiquitously in machine learning appli¬ 
cations including Gaussian graphical and Gaussian process models, partition functions of discrete graphical models, 
minimum-volume ellipsoids, metric learning and kernel learning. Log-determinant computation involves the Cholesky 
decomposition at the cost cubic in the number of variables, i.e., the matrix dimension, which makes it prohibitive 
for large-scale applications. We propose a linear-time randomized algorithm to approximate log-determinants for 
very large-scale positive definite and general non-singular matrices using a stochastic trace approximation, called the 
Hutchinson method, coupled with Chebyshev polynomial expansions that both rely on efficient matrix-vector multi¬ 
plications. We establish rigorous additive and multiplicative approximation error bounds depending on the condition 
number of the input matrix. In our experiments, the proposed algorithm can provide very high accuracy solutions at 
orders of magnitude faster time than the Cholesky decomposition and Schur completion, and enables us to compute 
log-determinants of matrices involving tens of millions of variables. 


1 Introduction 

Scalability of machine learning algorithms for extremely large data-sets and models has been increasingly the focus 
of attention for the machine learning community, with prominent examples such as first-order stochastic optimization 
methods and randomized linear algebraic computations. One of the important tasks from linear algebra that appears in a 
variety of machine learning problems is computing the log-determinant of a large positive definite matrix. For example, 
serving as the normalization constant for multivariate Gaussian models, log-determinants of covariance (and precision) 
matrices play an important role in inference, model selection and learning both the structure and the parameters for 
Gaussian Graphical models and Gaussian processes 03 USED. Log-determinants also play an important role in a 
variety of Bayesian machine learning problems, including sampling and variational inference 03). In addition, metric 
and kernel learning problems attempt to learn quadratic forms adapted to the data, and formulations involving Bregman 
divergences of log-determinants have become very popular |9[|30|. Finally, log-determinant computation also appears 
in some discrete probabilistic models, e.g., tree mixture models [ [20l Q]| and Markov random fields Oil . In planar 
Markov random fields i26lfl6l inference and learning involve log-determinants of general non-singular matrices. 

For a positive semi-definite matrix B £ numerical linear algebra experts recommend to compute log- 

determinant using the Cholesky decomposition. Suppose the Cholesky decomposition is B = Ll r , then log detf/L) = 

2 y\ ( log The computational complexity of Cholesky decomposition is cubic with respect to the number of vari¬ 
ables, i.e., 0(d 3 )[|] For large-scale applications involving more than tens of thousands of variables, this operation is 
not feasible. Our aim in this paper is to compute accurate approximate log-determinants for matrices of much larger 
size involving tens of millions of variables. 
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1 For sparse matrices with a small tree-width, the complexity of Cholesky decomposition is cubic in the tree-width. 
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Contribution. Our approach to compute accurate approximations of log-determinant for a positive definite matrix uses 
a combination of stochastic trace-estimators and Chebyshev polynomial expansions. Using the Chebyshev polynomi¬ 
als, we first approximate the log-determinant by the trace of power series of the input matrix. We then use a stochastic 
trace-estimator, called the Hutchison method cu, to estimate the trace using multiplications between the input matrix 
and random vectors. The main assumption for our method is that the matrix-vector product can be computed efficiently. 
For example, the time-complexity of the proposed algorithm grows linearly with respect to the number of non-zero en¬ 
tries in the input matrix. We also extend our approach to general non-singular matrices to compute the absolute values 
of their log-determinants. We establish rigorous additive and multiplicative approximation error bounds for approx¬ 
imating the log-determinant under the proposed algorithm. Our theoretical results provide an analytic understanding 
on our Chebyshev-Hutchison method depending on sampling number, polynomial degree and the condition number 
(i.e., the ratio between the largest and smallest singular values) of the input matrix. In particular, they imply that if 
the condition number is 0(1), then the algorithm provides e-approximation guarantee (in multiplicative or additive) in 
linear time for any constant e > 0. 

We first apply our algorithm to obtain a randomized linear-time approximation scheme for counting the number 
of spanning trees in a certain class of graphs where it could be used for efficient inference in tree mixture models 
mm. We also apply our algorithm for finding maximum likelihood parameter estimates of Gaussian Markov random 
fields of size 5000 x 5000 (involving 25 million variables!), which is infeasible for the Cholesky decomposition. Our 
experiments show that our proposed algorithm is orders of magnitude faster than the Cholesky decomposition and 
Schur completion for sparse matrices and provides solutions with 99.9% accuracy in approximation. It can also solve 
problems of dimension tens of millions in a few minutes on our single commodity computer. Furthermore, the proposed 
algorithm is very easy to parallelize and hence has a potential to handle even a bigger size. In particular, the Schur 
method was used as a part of QUIC algorithm ED for sparse inverse covariance estimation with over million variables, 
hence our algorithm could be used to further improve its speed and scale. 


Related work. Stochastic trace estimators have been studied in the literature in a number of applications. mm have 
used a stochastic trace estimator to compute the diagonal of a matrix or of matrix inverse. Polynomial approximations to 
band-pass filters have been used to count the number of eigenvalues in certain intervals on. Stochastic approximations 
of score equations have been applied in liZ7l to learn large-scale Gaussian processes. The works closest to ours which 
have used stochastic trace estimators for Gaussian process parameter learning are 1331 and 0 which instead use 
Taylor expansions and Cauchy integral formula, respectively. A recent improved analysis using Taylor expansions 
has also appeared in (8). However, as reported in Section [5] our method using Chebyshev expansions provides much 
better accuracy in experiments than that using Taylor expansions, and 0 need Krylov-subspace linear system solver 
that is computationally expensive. Il22l also use Chebyshev polynomials for log-determinant computation, but the 
method is deterministic and only applicable to polynomials of small degree. The novelty of our work is combining 
the Chebyshev approximation with Hutchison trace estimators, which allows us to design a linear-time algorithm with 
rigorous approximation guarantees. 


Organization. The structure of the paper is as follows. We introduce the necessary background in Section 2.2 and 
describe our algorithm with approximation guarantees in Section [3] Section [4] provides the proof of approximation 
guarantee of our algorithm, and we report experimental results in Section [5] 


2 Background 

In this section, we describe the preliminaries for our approach to approximate the log-determinant of a positive definite 
matrix. Our approach combines the following two techniques: (a) designing a trace-estimator for the log-determinant 
of positive definite matrix via Chebyshev approximation ED and (b) approximating the trace of positive definite matrix 
via Monte Carlo methods, e.g., Hutchison method fl4l . 
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2.1 Chebyshev Approximation 

The Chebyshev approximation technique is used to approximate analytic function with certain orthonormal polynomi¬ 
als. We use p n (x) to denote the Chebyshev approximation of degree n for a given function / : [—1, 1] —► R: 

n 

f(x) « p n {x) = E CjTj(x), 

3=0 

where the coefficient c, and the i-th Chebyshev polynomial r I',(x) are defined as 

' ^ n 

—r~r V ^ Xk > T o( x k) if * = 0 

n + 1 

k =0 ... 

Ci = n ( 1 ) 

2 _ , 

—— E f(xk) Ti(x k ) otherwise 
n + 1 ' 

/c—0 

T i+ i(a;) = 2xTi(x) - T-_i(a:) for i > 1 (2) 


where x k = cos f° r k = 0,1, 2,... n and T 0 (x) = 1, Ti(x) = x. 

Chebyshev approximation for scalar functions can be naturally generalized to matrix functions. Using the Cheby¬ 
shev approximation p n (x) for function f(x) = log( 1 —x) we obtain the following approximation to the log-determinant 
of a positive definite matrix B £ R. dxd : 

d 

log det B = log det (I — A) = ^ log(l — A^) 

2 = 1 

d d n 

~ YpM = E E C 3 r P'j{^i) 

2=1 2=1 j— 0 

n d n 

= E c 3 E t 3 = E c J tr - 

j =0 i—1 j=0 

where A = I — B has eigenvalues 0 < Ai,..., Xd < 1 and the last equality is from the fact that Yli -1 P(^i) = 
tr(p(A)) for any polynomial p^ We remark that other polynomial approximations, e.g., Taylor, can also be used to 
approximate log-determinants. We focus on the Chebyshev approximation in this paper due to its superior empirical 
performance and rigorous error analysis. 


2.2 Trace Approximation via Monte-Carlo Method 


The main challenge to compute the log-determinant of a positive definite matrix in the previous section is calculating the 
trace of T 7 (A) efficiently without evaluating the entire matrix A k . We consider a Monte-Carlo approach for estimating 
the trace of a matrix. First, a random vector z is drawn from some fixed distribution, such that the expectation of z 1 /1z 
is equal to the trace of A. By sampling m such i.i.d random vectors, and averaging we obtain an estimate of tr(A). 

It is known that the Hutchinson method, where components of the random vectors Z are i.i.d Rademacher random 
variables, i.e., Pr(+1) = Pr(—1) = f, has the smallest variance among such Monte-Carlo methods lfl4l I51. It has 
been used extensively in many applications mm®. Formally, the Hutchinson trace estimator tr m (A) is known to 
satisfy the following: 


E 


tr m (A) 


1 

nn 


E ^ Az * 

2=1 


tr(A) 


2 tr(-) denotes the trace of a matrix. 
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Var [tr m (A)] = 2 


Note that computing z T Az requires only multiplications between a matrix and a vector, which is particularly appealing 
when evaluating A itself is expensive, e.g., A = B k for some matrix B and large k. Furthermore, for the case 
A = Tj (X), one can compute z T Tj (X) z more efficiently using the following recursion on the vector Wj = Tj(X) z: 


which follows directly from 


ivj +1 = 2 Xwj — Wj- 1 , 


3 Log-determinant Approximation Scheme 


Now we are ready to present algorithms to approximate the absolute value of log-determinant of an arbitrary non¬ 
singular square matrix C. Without loss of generality, we assume that singular values of C are in the interval [cr m i n , cr max ] 
for some cr m i n , er max > 0, i.e., the condition number n(C) is at most 
are not sensitive to tight knowledge of cr n 
suffice. 


:= cr B 


t/ofnin- The proposed algorithms 
, <r max , but some loose lower and upper bounds on them, respectively. 


We first present a log-determinant approximation scheme for positive definite matrices in Section 3.1 and that for 


general non-singular ones in Section 3.2 later. 


3.1 Algorithm for Positive Definite Matrices 

In this section, we describe our proposed algorithm for estimating the log-determinant of a positive definite matrix 
whose eigenvalues are less than one, i.e., cr max < 1. It is used as a subroutine for estimating the log-determinant of a 
general non-singular matrix in the next section. The formal description of the algorithm is given in what follows. 


Algorithm 1 Log-determinant approximation for positive definite matrices with er max < 1 

Input: positive definite matrix B £ R dxd with eigenvalues in [4,1 <5] for some S > 0, sampling number m and 

polynomial degree n 
Initialize: A <— I — B,T <— 0 

for * = 0 to n do 

Ci 4— i-th coefficient of Chebyshev approximation for log(l — ^ 1 ~ 2 ^ a+1 ) 

end for 

for i = 1 to m do 

Draw a Rademacher random vector v and u 4— Cq v 
if n > 1 then 

wo 4— v and wi 4— Av 

U 4 — 11 -f- ClAv 

for j = 2 to n do 

w 2 4— 2A\Vi — w 0 

U <- U + c 3 W 2 

wo 4— wi and wi 4— w 2 

end for 
end if 

r 4 — r + v 1 u/m 

end for 
Output: T 


We establish the following theoretical guarantee of the above algorithm, where its proof is given in Section 4.3 


Theorem 1 Given e, £ £ (0, 1), consider the following inputs for Algorithm^ 


4 









• B £ R dxd be a positive definite matrix with eigenvalues in [<5, 1 — 5\ for some 6 £ (0,1/2). 

• to > 54e -2 log 


• n > 


I log(2 (1/5 — 1)) 

1 log (1/1 — (5) , 




= o 


(\/T lo g (??)) 


Then, it follow’s that 


Pr [ |logdet B — P| < e |logdet B\} > 1 — ( 
where T is the output of Algorithm [T] 

The bound on polynomial degree n in the above theorem is relatively tight, e.g., it implies to choose n = 14 for 
S = 0.1 and e = 0.01. However, our bound on sampling number to is not, where we observe that to ss 30 is sufficient 
for high accuracy in our experiments. We also remark that the time-complexity of Algorithm[I]is ()(rnn\\ /!||u), where 
||B||o is the number of non-zero entries of B. This is because the algorithm requires only multiplications of matrices 
and vectors. In particular, if to, n = 0(1), the complexity is linear with respect to the input size. Therefore, Theorem 
[TJimplies that one can choose to, n = 0(1) for e-multiplicative approximation with probability 1 - ( given constants 
£,C > 0. 

3.2 Algorithm for General Non-Singular Matrices 

Now, we are ready to present our linear-time approximation scheme for the log-determinant of general non-singular 
matrix C, through generalizing the algorithm in the previous section. The idea is simple: run Algorithm |T] with 
normalization of positive definite matrix C T C. This is formally described in what follows. 


Algorithm 2 Log-determinant approximation for general non-singular matrices 


Input: matrix C £ R dxd with singular values are in the interval [er, 
number to and polynomial degree n 

-C T C, S £- 


min, ^max] for Some (7 


min •> v max 


> 0, sampling 


Initialize: B< - 3 — , ^ 2 , 

<i„+C cr » i „+°L_- 

T <— Output of Algorithm [T| for inputs B. to, n, S 

Output: r <- (r + dlog (cr^ in + cr^ ax )) /2 


Algorithm 2 is motivated to design from the equality log | det C | = | log det C T C. Given non-singular matrix C, 
one need to choose appropriate er max , rx,,, jT1 to run it. In most applications, CT max is easy to choose, e.g., one can choose 


= >/||C||i||C'|| 0O 


or one can run the power iteration na to estimate a better bound. On the other hand, cr m i n is relatively not easy to 


obtain depending on problems. It is easy to obtain in the problem of counting spanning trees we studied in Section 3.3 


and it is explicitly given as a parameter in many machine learning log-determinant applications ED. In general, one 
can use the inverse power iteration ca to estimate it. Furthermore, the smallest singular value is easy to compute for 
random matrices Il29ll28ll and diagonal-dominant matrices ei eh. 

The time-complexity of Algorithm [ 2 ] is still 0(rnn\\C\\f) instead of 0(mn\\C T C\\f) since AI go r i t h 111 [T] re q u i re s 
multiplication of matrix C T C and vectors. We state the following additive error bound of the above algorithm. 


Theorem 2 Given e, ( £ (0,1), consider the following inputs for Algorithm]^ 

• C £ R dxd be a matrix such that singular values are in the interval [<r m i n , cr max ] for some cr m i n , <r max > 0. 

• to > A4 (£, , C I and n > fif (£, ), where 

V 1 V min ’ ^ / \ 7 0" m i n / 
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M(e,k, C) 


7V(e, re) : 


= JT (log(l + K 2 )) 2 log^ 
log (f (^?+T - l) 108 



V^+l+l N 
V2k 2 +1-1 ) 



Then, it follows that 


Pr [ |log (|det (71) — T| < ed ] > 1 — £ 


where T is the output of Algorithm [2] 

Proof. The proof of Theorem[2]is quite straightforward using Theorem[l]for B with the facts that 

2 log | det Cl = log det B + d log (cr^ in + a^ ax ) 

and | logdet S| < dlog fl + ■ 

\ min / [ 

We remark that the condition number cr max /<7 m in decides the complexity of Algorithm El As one can expect, 
the approximation quality and algorithm complexity become worse for matrices with very large condition numbers, as 
the Chebyshev approximation for the function log x near the point 0 is more challenging and requires higher degree 
approximations. 

When er max > 1 and cr m i n < 1, i.e. we have mixed signs for logs of the singular values, a multiplicative error bound 
(as stated in Theorem[l]i can not be obtained since the log-determinant can be zero in the worst case. On the other hand, 
when er max < 1 or <7 m i n > 1, we further show that the above algorithm achieves an e-multiplicative approximation 
guarantee, as stated in the following corollaries. 

Corollary 3 Given e, ( £ (0,1), consider the following inputs for Algorithm[2j 

• C £ R dxd be a matrix such that singular values are in the interval [<j m i n , cr max ] for some cr max < 1. 

• m > M (elog-J— ,? mi2 sC) 

• n > AT (elog —!—, ) 

\ Umax (-'min J 

Then, it follows that 


Pr [ |log |det C| — T| < £ |log |det C\ |] > 1 — C 
where T is the output of Algorithm [2] 

Corollary 4 Given s, f £ (0,1), consider the following inputs for Algorithm[2j 

• C £ R dxd be a matrix such that singular values are in the interval [<j m i n , cr max ] for some cr m i n > 1. 

• m> M (e log Cr m in, 

• > A/ - (elogcr min , 

\ u min J 

Then, it follows that 


Pr [ |log det C — T| < e log det C\ > 1 — ( 
where T is the output of Algorithm[2] 

The proofs of the above corollaries are given in the supplementary material due to the space limitation. 
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3.3 Application to Counting Spanning Trees 

We apply Algorithm [2] to a concrete problem, where we study counting the number of spanning trees in a simple 
undirected graph G = (V,E) where there exists a vertex i* such that ( i*,j ) £ E for all j £ V \ {**}. Counting 
spanning trees is one of classical well-studied counting problems, and also necessary in machine learning applications, 
e.g., tree mixture models EOl fTl. We denote the maximum and average degrees of vertices in V \ {% } by A max and 
A avg > 1, respectively. In addition, we let L{G) denote the Laplacian matrix of G. Then, from Kirchhoff’s matrix-tree 
theorem, the number of spanning tree t(G) is equal to 


t(G) = det L(i*), 


where L(i*) is the (|V| — 1) x (|V| — 1) sub matrix of L(G) that is obtained by eliminating the row and column 
corresponding to i*. Now, it is easy to check that eigenvalues of L{i*) are in [1, 2A max — 1], Under these observations, 
we derive the following corollary. 


Corollary 5 Given 0 < e < A 2 _ 1 , ( £ (0, 1), consider the following inputs for Algorithm |2j 


• C = L(i*) 

• m>M ( ^f^ ,2A max - 1,C) 

• n>A/'( E(A 7~ 1) ,2A 1MX -l) 

Then, it follows that 

Pr[|logr(G)-r| <elogr(G)] >1-C 


where T is the output of Algorithm [2] 


The proof of the above corollary is given in the supplementary material due to the space limitation. We remark that 
the running time of Algorithm [2] with inputs in the above theorem is 0(nmA avg \V\). Therefore, for e, ( = f2(l) and 
A avg = 0(1), i.e., G is sparse, one can choose n,m = 0(1) so that the running time of Algorithm [I] is O (| V \). 


4 Proof of Theorem [1] 


In order to prove Theorem[l] we first introduce some necessary background and lemmas on error bounds of Chebyshev 
approximation and Hutchinson method we introduced in Section 2.1 and Section 2.2 respectively. 


4.1 Convergence Rate for Chebyshev Approximation 

Intuitively, one can expect that the approximated Chebyshev polynomial converges to its original function as degree n 
goes to oo. Formally, the following error bound is known 171 [32l . 


Theorem 6 Suppose f is analytic with |/(z)| < M in the region bounded by the ellipse with foci ±1 and major and 
minor semiaxis lengths summing to I\ > 1. Let p n denote the interpolant of f of degree n in th Chebyshev points as 
defined in section \2.1\ then for each n > 0, 


max I fix) 

xe[-t,i] 


Pn(x) | < 


4 M 

(K - 1) I< n 


To prove Theorem|I]and Theorem[2j we are in particular interested in 


f(x) = log(l — x), for x £ [5,1 — <5], 
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Since Chebyshev approximation is defined in the interval [—1,1], e.g., see Section 2.1 one can use the following linear 
mapping g : [<5,1 — 5] —> [—1,1] so that 


max |(/ o g)(x) - p„(x)\ 
x6[-l,l] 

= max r \f (x) ~ (p n o g~ 1 )(x)\ 
cce [<5,1—<5] 1 1 

For notational convenience, we use p n {x) to denote ( p n o ^ _1 )(a:) in what follows. 

We choose the ellipse region, denoted by Ek, in the complex plane with foci ±1 and its semimajor axis length is 

1 / (1—<5) where fog is analytic on and inside. The length of semimajor axis of the ellipse is equal to (1/(1 — 5))“ — 1. 
Hence, the convergence rate K can be set to 


K = 


1 


1-5 


1 


1-5 


_ 72-5 + 75 
~ 72 ^ 5-75 > 


The constant M can be also obtained using the fact that |logz| = |log \z\ + % arg (z)| < \j (log \z\) 2 + t r 2 for any 
z £ C as follows: 

max \(f°g)(z) \ = max \ (f ° g)(z)\ = max |log (1 - g(z))\ 


Z&£k 


z&£k 


z&£k 


< max \/(log|l-3(z)|) 2 + 7r 2 

zes K v 


1 


=7log 2 - - 1 +7T 2 < 5log 2 - - 1 := M. 


1 


Hence, for x € [5,1 — 5], 


I log (1 - x) -p n {x)\ < 


20 log (2 (j — l)) 


(.K - 1 )K n 

Under these observations, we establish the following lemma that is a ‘matrix version’ of Theorem[6] 

Lemma 7 Let B £ K dX£i be a positive definite matrix whose eigenvalues are in [5,1 — 5] for 5 £ (0,1/2). Then, it 
holds that 

| logdetS - tr(p n (I - B))\ < ?? dIog ( 2 G “ *)) 


(K - 1) K n 


where K = 


Proof. Let Ai, A 2 , ■ ■ • , A d £ [5,1 — 5] be eigenvalues of matrix A = I — B. Then, we have 

|logdet(/ - A) - tr {p n (A))\ = |tr (log(7 - A)) - tr (p n (A))\ 

d d 

= ^ 1 °g( 1 -A i )-^p„(A i ) 


i =1 
d 


< Y l 108 ^ 1 “ Ai ) I 


i—1 

d 


^ Y 


20 log (2 (| — l)) 
(K - 1) K n 


where we use Theorem[6] This completes the proof of Lemma[7] 
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4.2 Approximation Error of Hutchinson Method 


In this section, we use the same notation, e. 
trace estimator tr m (-) defined in Section 2.2 


g., f,p n , used in the previous section and we analyze the Hutchinson’s 
To begin with, we state the following theorem that is proven in J24j. 


Theorem 8 Let A £ be a positive definite or negative definite matrix. Given eg, (o G (0,1), it holds that 


Pr [|tr m (A) - tr(A)| < £ 0 tr(A)] > 1 - Co 


if sampling number m is no smaller than 6 £ 0 2 log(2/Co)- 


The theorem above provides a lower-bound on the sampling complexity of Hutchinson method, which is indepen¬ 
dent of a given matrix A. To prove Theorem[l] we need an error bound on tr m (p,,(,4)). However, in general we may 
not know whether or not p n (A) is positive definite or negative definite. We can guarantee that the eigenvalues of p n (A) 
will be negative using the following lemma. 


Lemma 9 p n (x) is a negative-valuedpolynomial in the interval [<5,1 — 6] if 


20log (2(^-1)) 

{K- l)K n 


< log 



where we recall that K = 

\/2 — o — v o 

Proof. From Theorem[6] we have 


max p n (x) = max f(x) + (p„(x) - f(x)) 

[< 5 , 1 —< 5 ] [ 5 , 1 — 5 ] 

< maXj f(x) + max | p n {x) - f{x)\ 

, ^ 20 log (2(1 — 1)) 

< log 1 - 6 + — , v / - A < 0, 

_ B v (A - 1) K n ~ 

log(l — 6 ). This completes the proof of Lemma |o| 


where we use 20 < — 

4.3 Proof of the Theorem [T] 

Now we are ready to prove Theorem |T| First, one can check that sampling number n in the condition of Theorem |T] 
satisfies 

201°g(2G —1)) <; / M 

(K- pjf“ “ 2 B \l - S) 

Hence, from Lemma[9j it follows that p n (A) is negative definite where A = I — B and eigenvalues of B are in [5,1 — <5], 
Hence, we can apply Theorem|8]as 


Pr 


|tr (p n (A)) - tr m (p n (A))\ < - jtr (p„(A))| 


>i-C, 


(4) 


for to > 54e 


' log ^4 ). In addition, from Theorem|7j we have 


|tr (p n (A))\ - |logdetH| < |logdetB — tr (p n (A))\ 

20d log (2(1/<5 - 1)) 


< 


(K — 1) K n 


< tjdlog 


1 


1-6 


< - | log detf?|, 
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Figure 1: Performance evaluations of Algorithm [2] and comparisons with other ones: (a) running time vs. dimension, 
(b) relative accuracy, (c) comparison in running time with Cholesky decomposition and Schur complement and (d) 
comparison in accuracy with Taylor approximation in lf33l . The relative accuracy means a ratio between the absolute 
error of the output of an approximation algorithm and the actual value of log-determinant. 


which implies that 


tr < 


■{Pn(A))\ < + l ) Il°gdetf?| < - |logdeti?|. 

Combining <[3]), (|4j) and Q leads to the conclusion of Theorem[l]as follows: 


1 - C < Pi' |tr (p„(A)) - tr m (p„(A))| < - |tr (p n {A))\ 


< Pr 


|tr (p n (A)) - tr m (p„(A))| < - |logdetB| 


< Pr[|tr {p n {A)) - tr m (p n (A))\ + | logdetB - tr (p n (A)) 

< 11logdetf?| + | |logdetf?|] 

< Pr [|logdetB — tr m (p n (A))\ < £ |logdetS|] 

= Pr [jlogdetf? — T| < e |logdetf?|], 


(5) 


where T = tr m (p n (A)). 

5 Experiments 

We now study our proposed algorithm on numerical experiments with simulated and real data. 

5.1 Performance Evaluation and Comparison 

We first investigate the empirical performance of our proposed algorithm on large sparse random matrices. We generate 
a random matrix C € R dxd , where the number of non-zero entries per each row is around 10. We first select five non¬ 
zero off-diagonal entries in each row with values uniformly distributed in [—1,1]. To make the matrix symmetric, we 
set the entries in transposed positions to the same values. Finally, to guarantee positive definiteness, we set its diagonal 
entries to absolute row-sums and add a small weight, 10 -3 . 

Figure[l](a) shows the running time of Algorithm [ 2 ] from d = 10 3 to 3 x 10 7 , where we choose m 10, n = 15, 
fj min = 10- 3 and o' max = 11C711 1 . It scales roughly linearly over a large range of sizes. We use a machine with 3.40 
Ghz Intel 17 processor with 24 GB RAM. It takes only 500 seconds for a matrix of size 3 x 10' with 3 x 10 s non-zero 
entries. In Figure [I] (b), we study the relative accuracy compared to the exact log-determinant computation up-to size 
3 x 10 4 . Relative errors are very small, below 0.1%, and appear to only improve for higher dimensions. 

Under the same setup, we also compare the running time of our algorithm with other algorithm for computing 
determinants: Cholesky decomposition and Schur complement. The latter was used for sparse inverse covariance 
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(a) (b) 


Figure 2: GMRF interpolation of Ozone measurements: (a) original sparse measurements and (b) interpolated values 
using a GMRF with parameters fitted using Algorithm [2] 

estimation with over a million variables fl3ll and we run the code implemented by the authors. The running time of 
the algorithms are reported in Figure |T] (c). The proposed algorithm is dramatically faster than both exact algorithms. 
We also compare the accuracy of our algorithm to a related stochastic algorithm that uses Taylor expansions (33). 
For a fair comparison we use a large number of samples, n = 1000, for both algorithms to focus on the polynomial 
approximation errors. The results are reported in Figure|T](d), showing that our algorithm using Chebyshev expansions 
is superior in accuracy compared to the one based on Taylor series. 


5.2 Maximum Likelihood Estimation for GMRF 


GMRF with 25 million variables for synthetic data. We now apply our proposed algorithm for maximum likeli¬ 
hood (ML) estimation in Gaussian Markov Random Fields (GMRF) ll25l . GMRF is a multi-variate joint Gaussian 
distribution defined with respect to a graph. Each node of the graph corresponds to a random variable in the Gaus¬ 
sian distribution, where the graph captures the conditional independence relationships (Markov properties) among the 
random variables. The model has been extensively used in many applications in computer vision, spatial statistics, 
and other fields. The inverse covariance matrix J (also called information or precision matrix) is positive definite and 
sparse: Jij is non-zero only if the edge { i, j} is contained in the graph. 

We first consider a GMRF on a square grid of size 5000 x 5000 (with d = 25 million variables) with precision 
matrix J £ W lxd parameterized by p, i.e., each node has four neighbors with partial correlation p. We generate a 
sample x from the GMRF model (using Gibbs sampler) for parameter p = —0.22. The log-likelihood of the sample is: 
logp(x|p) = logdet J(p) — x T J(p)x + G, where J(p) is a matrix of dimension 25 x 10 6 and 10 s non-zero entries, 
and G is a constant independent of p. We use Algorithm[2]to estimate the log-likelihood as a function of p, as reported 
in Figure[3] The estimated log-likelihood is maximized at the correct (hidden) value p = —0.22. 


GMRF with 6 million variables for Ozone data. We also consider GMRF parameter estimation from real spatial 
data with missing values. We use the data-set from (3j that provides satellite measurements of Ozone levels over 
the entire earth following the satellite tracks. We use a resolution of 0.1 degrees in lattitude and longitude, giving a 
spatial held of size 1681 x 3601, with over 6 million variables. The data-set includes 172 thousands measurements. 
To estimate the log-likelihood in presence of missing values, we use the Schur-complement formula for determinants. 


Let the precision matrix for the entire held be J = 


where subsets x n and x- denote the observed and 


Jo Jo,: 

\Jz,o Jz , 

unobserved components of x. The marginal precision matrix of x Q is J a = J a — J 0 . z Jj ] Jz,o- Its log-determinant is 
computed as log(det( J Q )) = log det( J) — log det( J 2 ) via Schur complements. To evaluate the quadratic term x' 0 J 0 x a 
of the log-likelihood we need a single linear solve using an iterative solver. We use a linear combination of the thin- 
plate model and the thin-membrane models |f25l , with two parameters a and /3: J = al + (/ 3)Jt p + (1 — f3)Jtm and 
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Figure 3: Log-likelihood estimation for hidden parameter p for square GMRF model of size 5000 x 5000. 


obtain ML estimates using Algorithm [2] Note that o mm (./) = a. We show the sparse measurements in Figure [5] (a) 
and the GMRF interpolation using fitted values of parameters in Figure[2](b). 


6 Conclusion 

Tools from numerical linear algebra, e.g. determinants, matrix inversion and linear solvers, eigenvalue computation and 
other matrix decompositions, have been playing an important theoretical and computational role for machine learning 
applications. While most matrix computations admit polynomial-time algorithms, they are often infeasible for large- 
scale or high-dimensional data-sets. In this paper, we design and analyze a high accuracy linear-time approximation 
algorithm for the logarithm of matrix determinants, where its exact computation requires cubic-time. Furthermore, it 
is very easy to parallelize since it requires only (separable) matrix-vector multiplications. We believe that the proposed 
algorithm will find numerous applications in machine learning problems. 
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A Proof of Corollary [3] 

For given e < *— , %—set e 0 = § log ( 2 ) ■ Since all eigenvalues of C T C are positive and less than 1, it follows 

°Sl^max / Z \ ff max / 

that 


jlogdet (C T C) | = 


Y lo § Ai 


j=i 


> dlog ( 




where A i are i-th eigenvalues of C T C. Thus, 


£0 = 2 lo S 


1 


< 

“ 2 


£ |logdet (jTC | _ |log (|det C|)| 


= e- 


We use £0 instead of £ from Theorem[2] then following 

Pr [ |log (|det C\) — P| < £ |log (|det C|)| ] > 1 — C 


holds if m and n satifies below condition. 


B Proof of Corollary [4] 

Similar to proof of Corollary[3j set £ 0 = | log cr 2 nin . Since eigenvalues of C T C are greater than 1, 

jlogdet (C T C) | > dlogcr^ in 

and £o < £ ^ ct C ^ . From Theorem]^] we substitute £o into £ and 

Pr [ |logdet C — T| < £ |logdet C\ ] > 1 — C 
holds if in and n satifies below condition. 


C Proof of Corollary [5] 

For £q = £(A avg — l)/2, £ £ (0,1), Theorem [2] provides the following inequality: 


Pr (| log det L(i*) -T\< £ 0 (|F| -!))>!-(• 


Observe that since vertex i* is connected all other vertices, the number of spanning tree, i.e., det L(i*), is greater than 

2(|V|- 1 )(A avg -t)/2_ pf enc6i we Jj ave 

Pr(|logdetL(i*)-r|<£o(|V|-l)) 

< Pr (| log det L(i*) — P| < £ log det L(i*)). 


This completes the proof of Corollary [5] 
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